The following material has been adapted from the 2023 Swiss Institute of Bioinformatics single cell RNA sequencing course (https://sib-swiss.github.io/single-cell-training/) authored by Tania Wyss, Rachel Marcone-Jeitziner, Geert van Geest, and Patricia Palagi
library(Seurat)
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
## (status 2 uses the sf package in place of rgdal)
## Attaching SeuratObject
load("filtered_seu_pbmmc_dataset.RData")
Typical preprocessing steps are normalisation, variable gene selection, and scaling. Preprocessing starts with the filtered Seurat object. You can run the following steps repeatedly and Seurat will overwrite previous results each time. For example, if you scale using 2000 genes but then decide to rerun with 3000 genes then the scale.data slot will be replaced with results for 3000 genes.
?NormalizeData
# Note the different normalisation methods
seu <- Seurat::NormalizeData(seu,assay = "RNA",
normalization.method = "LogNormalize",
scale.factor = 10000)
?FindVariableFeatures
# Note the different selection methods and the nfeatures argument
seu <- Seurat::FindVariableFeatures(seu,assay = "RNA",
selection.method = "vst",
nfeatures = 2000)
?ScaleData
The features argument specifies the genes that are being scaled. If you want to run on all genes then use rownames(seu). The default is to only scale the variable features. The ScaleData function is prone to memory problems (vector limit exhausted) so often you will not be able to run it on all genes.
# Run ScaleData on all genes in the object
seu <- Seurat::ScaleData(seu,assay = "RNA",
features = rownames(seu))
## Centering and scaling data matrix
# Run ScaleData on only the variable features using the VariableFeatures function
seu <- Seurat::ScaleData(seu,assay = "RNA",
features = VariableFeatures(seu))
## Centering and scaling data matrix
Sometimes it can be useful to plot the variable features. You can use the VariableFeatures function to get the top variable features and plot them using VariableFeaturePlot.
# The top 10 variable features are extracted using head()
top10 <- head(Seurat::VariableFeatures(seu), 10)
# See what the top 10 variable features are
top10
## [1] "IGKC" "HBG2" "IGHG3" "IGHG1" "JCHAIN" "HBG1" "IGHA1" "IGHGP"
## [9] "IGLC2" "IGLC3"
# Here we run the VariableFeaturePlot function but save the output to an object
vf_plot <- Seurat::VariableFeaturePlot(seu)
#
# Then we use the LabelPoints function to label the top10 variable features
# The repel argument stops the gene names from being put on top of one another
Seurat::LabelPoints(plot = vf_plot,
points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
PCA is the major dimensionality reduction step. By default, RunPCA computes the top 50 PCs but we will compute the top 100. By default it uses scaled data for the top variable genes identified previously. After you run RunPCA, it prints out the top 30 genes from the top 5 principal components and it is good to check these genes. Are there any cell markers?
seu <- Seurat::RunPCA(seu,npcs = 100)
## PC_ 1
## Positive: TUBB, TUBA1B, KIAA0101, HMGB2, HMGB1, NUSAP1, HIST1H4C, SMC4, TOP2A, H2AFZ
## TYMS, HMGN2, DEK, MKI67, RRM2, BIRC5, PCNA, PTTG1, DUT, CDK1
## STMN1, CCNA2, RAN, CKS1B, CENPF, UBE2T, CENPU, ZWINT, SMC2, NUCKS1
## Negative: IL32, LTB, IFIT1B, TRAC, KRT1, IFITM1, CD3G, TRBC1, TRBC2, BPGM
## FTL, FCN1, CD27, CD8B, COTL1, S100A12, S100A10, ARL4A, CD2, VCAN
## TNFAIP3, TYROBP, S100A11, CD14, GIMAP4, TMCC2, FCER1G, SERPINA1, MNDA, IL7R
## PC_ 2
## Positive: VIM, ACTG1, CD74, GAPDH, RPS2, HLA-DRA, PTMA, HLA-DRB1, SRGN, MT-CO3
## AIF1, AP1S2, TYROBP, FCER1G, HNRNPA1, CST3, RPL37A, AC090498.1, LGALS1, LST1
## CTSS, RPL23, S100A11, PKM, GSTP1, MNDA, ENO1, ANXA1, HLA-DQB1, SPI1
## Negative: ALAS2, SLC4A1, SLC25A37, HBM, GYPA, AHSP, CA1, GYPB, HBD, HEMGN
## IFI27, FECH, HMBS, HBA1, BLVRB, EPB42, SELENBP1, RHAG, HBB, HBA2
## CA2, SMIM1, KLF1, FAM210B, PRDX2, RHCE, TSPO2, RFESD, STOM, BPGM
## PC_ 3
## Positive: LYZ, CD36, MNDA, TYROBP, FCN1, S100A9, CST3, FCER1G, S100A4, VCAN
## S100A8, S100A11, S100A6, LGALS1, CSTA, TYMP, PSAP, LST1, S100A12, FGL2
## SERPINA1, CD14, CFD, CXCL8, MS4A6A, KLF4, SRGN, FTH1, BLVRB, CLEC12A
## Negative: CD24, IGLL1, SOX4, VPREB1, VPREB3, CD79B, ZNF90, CD79A, CMTM8, TCL1A
## LEF1, PDLIM1, IGLL5, EBF1, PTMA, LDHB, CD9, AC090498.1, LTB, STMN1
## MSI2, RCSD1, NPM1, MME, FAM129C, TIFA, PARP1, POU2AF1, SSBP2, EEF1B2
## PC_ 4
## Positive: CD79B, CD24, TCL1A, CD9, VPREB3, UBE2C, IGHM, TOP2A, NUSAP1, VPREB1
## IGLL5, KIAA0226L, CD79A, AURKB, CENPA, CFAP73, MKI67, PLK1, GTSE1, RGS2
## FAM129C, IGLL1, CDCA8, HLA-DRA, PCDH9, IRF4, IGLC5, ASPM, HMMR, TIFA
## Negative: HSP90AB1, FAM178B, APOC1, CNRIP1, PRSS57, RPS2, RPL37A, EEF1B2, KCNH2, RPL23
## RPL7A, HSPD1, CYTL1, NME4, SYNGR1, RPLP0, AKR1C3, EPCAM, NPM1, LDHB
## MYC, AC090498.1, HSPE1, MIF, FKBP4, C1QBP, NME1, MLLT3, SRM, EBPL
## PC_ 5
## Positive: IFITM1, IL32, TRAC, CD3G, TRBC1, TRBC2, AC090498.1, LTB, RPL37A, CD2
## IL7R, RPS2, TNFAIP3, CD8B, CD27, RPL23, CENPE, MT-CO3, ASPM, HMMR
## ITM2A, RORA, GIMAP4, DEPDC1, DLGAP5, PLK1, KIF23, AURKA, CENPA, TRAT1
## Negative: IFIT1B, BPGM, TRIM58, TMCC2, KRT1, ARL4A, SLC25A37, CLIC2, HLA-DRA, ALAS2
## EIF1AY, LPIN2, XPO7, HBB, IFI27, HBD, HBA2, HBA1, SELENBP1, SLC40A1
## LGALS3, FECH, ARG1, SLC2A1, CD74, IGHM, MOSPD1, MZB1, IGKC, ANKRD9
#
Dimensionality reductions can be explored using the DimPlot and FeaturePlot functions. DimPlot is used to overlay categorical variables and FeaturePlot numeric variables (eg metadata variables or genes). You can specify the reduction (pca,umap,tsne) using the reduction argument. If you don’t specify a reduction then DimPlot will use whatever reductions you have in your object in the following order of preference - umap, tsne, and pca.
# DimPlot of the PCA coloured by the active.ident
Seurat::DimPlot(seu, reduction = "pca")
# DimPlot of the PCA coloured by phase
Seurat::DimPlot(seu, reduction = "pca",group.by = "Phase")
# FeaturePlot of the PCA coloured by percent.mito and the HBB gene
Seurat::FeaturePlot(seu, reduction = "pca", features = c("percent.mito","HBB"))
#
This is never an easy question to answer and if you asked 10 different people you might get 10 different answers. Generally speaking, a number of somewhere between 15-50 PCs is common. Here are some of the plots that you can do to help make your decision.
Heatmaps are helpful tools for visualising the top genes for each principal component. The Seurat DimHeatmap function is used below to plot the top 12 and bottom 12 PCs. The dims argument specifies the PCs to use, cells allows you to plot just the top n cells for each principal component (including all cells can be very slow), balanced means that the plots show equal numbers of positive and negative genes (genes up and down for each PC). Another key argument is “fast”. If fast is TRUE then DimHeatmap uses “image” rather than “ggplot2” to generate the plot. “image” is faster but it means that the plot is not customisable.
# Plot the top 12 PCs
Seurat::DimHeatmap(seu, dims = 1:12, cells = 500, balanced = TRUE)
# Plot the bottom 12 PCs
Seurat::DimHeatmap(seu, dims = 89:100, cells = 500, balanced = TRUE)
The signal drops off after a certain number of principal components and essentially becomes meaningless noise. An elbow plot is a valuable tool for showing the amount of extra variation explained by each PC. I like to plot 100 PCs mainly for this plot to get a better idea of when it flattens out (the elbow point)
Seurat::ElbowPlot(seu, ndims = 100)
The plot is very flat after 50 PCs so let’s just plot the first 50 instead to zoom in on where the “elbow” of the plot is
Seurat::ElbowPlot(seu, ndims = 50)
Generally using too many PCs doesn’t matter that much, but using too few can be a problem. It flattens off between 20 and 30 so we will go for 25.
There are other options such as JackStrawPlot that provide more of a statistically driven approach for deciding on how many PCs to use but the ElbowPlot provides a simple and fast way of doing it.
We make a UMAP using the RunUMAP function and specify the number of PCs using the dims argument. There are a lot of arguments for the RunUMAP function. The main ones to be aware of are dims, n.neighbors, and min.dist. Every time you run RunUMAP, the contents of the reduction slot is replaced unless you change the reduction.name argument. If you wanted to you could save a separate reduction in the Seurat object for different numbers of PCs
# make a UMAP using the first 25 PCs
seu <- Seurat::RunUMAP(seu, dims = 1:25)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 11:56:18 UMAP embedding parameters a = 0.9922 b = 1.112
## 11:56:18 Read 6830 rows and found 25 numeric columns
## 11:56:18 Using Annoy for neighbor search, n_neighbors = 30
## 11:56:18 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 11:56:18 Writing NN index file to temp file /var/folders/mp/p7swk1jn6y32qtcbnwszqm680000gp/T//RtmpB2gdYd/filee1294d97f367
## 11:56:18 Searching Annoy index using 1 thread, search_k = 3000
## 11:56:19 Annoy recall = 100%
## 11:56:19 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 11:56:20 Initializing from normalized Laplacian + noise (using irlba)
## 11:56:20 Commencing optimization for 500 epochs, with 284056 positive edges
## 11:56:26 Optimization finished
# Plot the UMAP
Seurat::DimPlot(seu, reduction = "umap")
This is clearly stated on the UMAP package website as well. If you run RunUMAP on the same dataset repeatedly you will not get the same exact result every time even though the same seed is used by RunUMAP. The reasons for this are complex and relate to things such as how many cores your computer is using. Regardless of the exact cause, you should ALWAYS save your Seurat object or the embeddings. You don’t want to try to regenerate results sometime next year only to find that the UMAP is different and your clusters have changed
# let's save the plot of the first 25 PCs as an object
plot25 <- Seurat::DimPlot(seu, reduction = "umap")
# Make a UMAP using just the first 5 PCs
seu <- Seurat::RunUMAP(seu, dims = 1:5, verbose = F)
plot5 <- Seurat::DimPlot(seu, reduction = "umap")
# Make a UMAP using just the first 50 PCs
seu <- Seurat::RunUMAP(seu, dims = 1:50, verbose = F)
plot50 <- Seurat::DimPlot(seu, reduction = "umap")
# plot them one at a time
plot5
plot25
plot50
# or all at the same time
# plot5 + plot25 + plot50
The same functions can be used as for plotting on PCA. First, we need to rerun UMAP to make sure that the UMAP in the Seurat object has been made using the number of PCs that we actually want
seu <- Seurat::RunUMAP(seu, dims = 1:25, verbose = F)
# DimPlot of the UMAP coloured by phase
Seurat::DimPlot(seu, reduction = "umap",group.by = "Phase")
# FeaturePlot of the PCA coloured by some T-cell genes
Seurat::FeaturePlot(seu, reduction = "umap", features = c("CD3D","CD3E"))
# FeaturePlot of the PCA coloured by percent.globin and the HBB gene
Seurat::FeaturePlot(seu, reduction = "umap", features = c("percent.globin","HBB"))
You can add UMAP embeddings using the CreateDimReducObject function. This is particularly useful if you have made a UMAP elsewhere (eg Python). You can also export UMAP or PCA embeddings using the Embeddings function if you want to do further analysis outside of R
# Get the pca embeddings directly from a Seurat object
pca_embeddings <- Embeddings(object = seu,reduction = "pca")
# Get the umap embeddings directly from a Seurat object
umap_embeddings <- Embeddings(object = seu, reduction = "umap")
# don't run - example code only
# seu[['umap']] <- CreateDimReducObject(embeddings = umap_embeddings)
save(seu,file = "seu_preprocess_dimreduction.RData")
#########################################################################################
#########################################################################################
SCTransform performs normalisation, gene selection, and scaling all at the same time. It is the latest preferred approach by the Seurat authors but it is more opaque with many arguments; including many arguments that call other functions. This means that it isn’t easy to see exactly what it is doing without performing some fairly deep digging. SCTransform creates a separate assay in the Seurat object called “SCT”. If you are going to use SCTransform then it makes sense to use the most recent version (SCTransform v2). To run v2, you need to have the SCTransform package installed of at least version 0.3.3 or greater. For SCTransform there are some key arguments such as a min_cells argument (makes sense to keep consistent with CreateSeuratObject) that has a default value of 5. There is also a “clip.ranges” argument that removes genes with outlier residuals. This means that the SCT assay can have fewer genes than the RNA assay.
# install the glmGamPoi Bioconductor package - BiocManager::install("glmGamPoi")
seu <- SCTransform(seu, assay = "RNA", verbose = TRUE,vst.flavor = "v2",min_cells=3,
return.only.var.genes = FALSE,variable.features.n = 3000)
Stop and have a look at the seurat object
seu
## An object of class Seurat
## 37226 features across 6830 samples within 2 assays
## Active assay: SCT (18553 features, 3000 variable features)
## 1 other assay present: RNA
## 2 dimensional reductions calculated: pca, umap
SCTransform makes a separate assay (SCT). SCTransform normalisation is good for clustering but there is still debate about which counts are best to use for differential expression analysis most people will use the counts from NormalizeData() for differential expression ### Changing default assay SCTransform changes the default assay to SCT. We will change it back to RNA using the DefaultAssay function so that the UMAP plots below use the same assay as those from the main body of the session
DefaultAssay(seu) <- "RNA"
# You can make a vector of gene names and pass that to FeaturePlot instead
genes_int <- c("HBA1","HBA2","HBB")
Seurat::FeaturePlot(seu, reduction = "pca", features = genes_int)
Make a UMAP using just the first 50 PCs and plot it
seu <- Seurat::RunUMAP(seu, dims = 1:50,reduction.name = "umap50",verbose = F)
## Warning: Cannot add objects with duplicate keys (offending key: UMAP_), setting
## key to 'umap50_'
Seurat::DimPlot(seu, reduction = "umap50")
plot25 was made during the main session above
seu <- Seurat::RunUMAP(seu, dims = 1:25,n.neighbors = 5,verbose = F)
plot25_n5 <- Seurat::DimPlot(seu, reduction = "umap")
plot25_n5 + plot25
# min.dist 0.1
seu <- Seurat::RunUMAP(seu, dims = 1:25,min.dist = 0.1,verbose = F)
plot25_min0.1 <- Seurat::DimPlot(seu, reduction = "umap")
# min.dist 0.5
seu <- Seurat::RunUMAP(seu, dims = 1:25,min.dist = 0.5,verbose = F)
plot25_min0.5 <- Seurat::DimPlot(seu, reduction = "umap")
# min.dist 0.1
plot25_min0.1
# plot25 was made during the main session above - uses the default of 0.3
plot25
# min.dist 0.5
plot25_min0.5